models.ldamodel – Latent Dirichlet Allocation

您所在的位置:网站首页 estimater implementing报错 models.ldamodel – Latent Dirichlet Allocation

models.ldamodel – Latent Dirichlet Allocation

2023-10-17 00:40| 来源: 网络整理| 查看: 265

Train the model with new documents, by EM-iterating over the corpus until the topics converge, or until the maximum number of allowed iterations is reached. corpus must be an iterable.

In distributed mode, the E step is distributed over a cluster of machines.

Notes

This update also supports updating an already trained model (self) with new documents from corpus; the two models are then merged in proportion to the number of old vs. new documents. This feature is still experimental for non-stationary input streams.

For stationary input (no topic drift in new documents), on the other hand, this equals the online update of ‘Online Learning for LDA’ by Hoffman et al. and is guaranteed to converge for any decay in (0.5, 1]. Additionally, for smaller corpus sizes, an increasing offset may be beneficial (see Table 1 in the same paper).

Parameters

corpus (iterable of list of (int, float), optional) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms) used to update the model.

chunksize (int, optional) – Number of documents to be used in each training chunk.

decay (float, optional) – A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined. Corresponds to from ‘Online Learning for LDA’ by Hoffman et al.

offset (float, optional) – Hyper-parameter that controls how much we will slow down the first steps the first few iterations. Corresponds to from ‘Online Learning for LDA’ by Hoffman et al.

passes (int, optional) – Number of passes through the corpus during training.

update_every (int, optional) – Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning.

eval_every (int, optional) – Log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x.

iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

gamma_threshold (float, optional) – Minimum change in the value of the gamma parameters to continue iterating.

chunks_as_numpy (bool, optional) – Whether each chunk passed to the inference step should be a numpy.ndarray or not. Numpy can in some settings turn the term IDs into floats, these will be converted back into integers in inference, which incurs a performance hit. For distributed computing it may be desirable to keep the chunks as numpy.ndarray.



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3